## [1] "Omitting 1485 NA records (0.92% of all records)"
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.000 Min. : 101.0 DAL :174
## 1st Qu.:2005 1st Qu.:1.000 1st Qu.: 840.2 MDW :120
## Median :2009 Median :2.000 Median :1194.0 JFK : 64
## Mean :2008 Mean :2.431 Mean :1181.6 EWR : 56
## 3rd Qu.:2012 3rd Qu.:3.000 3rd Qu.:1617.0 HOU : 51
## Max. :2014 Max. :4.000 Max. :2620.0 LGB : 49
## NA's :5 (Other):971
## airport_2 passengers fare carrier_lg
## MDW :180 Min. : 0.1000 Min. : 63.95 99 : 4
## LGB : 75 1st Qu.: 0.1099 1st Qu.: 216.05 CO : 3
## JFK : 66 Median : 0.2174 Median : 298.00 G4 : 2
## HOU : 64 Mean : 0.7028 Mean : 344.06 AA : 1
## SWF : 64 3rd Qu.: 0.3297 3rd Qu.: 409.22 US : 1
## PIE : 61 Max. :137.0000 Max. :2133.00 (Other): 0
## (Other):975 NA's :1474
## large_ms fare_lg carrier_low lf_ms
## Min. :0.0454 Min. : 95.76 99 : 4 Min. :1
## 1st Qu.:0.0670 1st Qu.:121.76 G4 : 2 1st Qu.:1
## Median :0.0800 Median :196.00 AA : 1 Median :1
## Mean :0.4656 Mean :256.97 3M : 0 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:322.75 9K : 0 3rd Qu.:1
## Max. :1.0000 Max. :618.00 (Other): 0 Max. :1
## NA's :1478 NA's :1478 NA's :1478 NA's :1482
## fare_low
## Min. : 95.76
## 1st Qu.:102.64
## Median :109.52
## Mean :113.09
## 3rd Qu.:121.76
## Max. :134.00
## NA's :1482
## [1] "Year" "quarter" "nsmiles" "airport_1" "airport_2"
## [6] "passengers" "fare" "carrier_lg" "large_ms" "fare_lg"
## [11] "carrier_low" "lf_ms" "fare_low" "haul"
## Classes 'tbl_df' and 'data.frame': 160258 obs. of 14 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ quarter : num 1 1 1 1 1 1 1 1 1 1 ...
## $ nsmiles : int 1300 151 1308 1347 1351 160 1339 646 654 963 ...
## $ airport_1 : Factor w/ 295 levels "ABE","ABI","ABQ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ airport_2 : Factor w/ 299 levels "ATL","AUS","AVL",..: 48 51 54 64 106 113 115 160 195 278 ...
## $ passengers : num 0.889 9.444 41.667 0.222 2.222 ...
## $ fare : num 130 298 355 175 322 ...
## $ carrier_lg : Factor w/ 39 levels "3M","99","9K",..: 9 33 33 9 11 32 33 24 32 33 ...
## $ large_ms : num 0.75 0.965 0.421 1 0.7 ...
## $ fare_lg : num 149 304 337 175 358 ...
## $ carrier_low: Factor w/ 49 levels "3M","99","9K",..: 13 42 42 11 29 40 42 29 40 42 ...
## $ lf_ms : num 0.25 0.965 0.421 1 0.2 ...
## $ fare_low : num 74.5 304.1 337.4 174.5 129.3 ...
## $ haul : Factor w/ 2 levels "short","medium": 2 1 2 2 2 1 2 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1485] 33 95 194 266 270 427 430 432 441 579 ...
## .. ..- attr(*, "names")= chr [1:1485] "33" "95" "194" "266" ...
## $airport_1
## [1] "ABE" "ABI" "ABQ" "ABY" "ACK" "ACT" "ACV" "ACY" "AEX" "AGS" "ALB"
## [12] "ALO" "ALW" "AMA" "APF" "ART" "ASE" "ATL" "ATW" "AUS" "AUW" "AVL"
## [23] "AVP" "AZA" "AZO" "BDL" "BFL" "BGM" "BGR" "BHB" "BHM" "BIL" "BIS"
## [34] "BJI" "BKG" "BLI" "BLV" "BMI" "BNA" "BOI" "BOS" "BPT" "BQK" "BRO"
## [45] "BTM" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CEC" "CHA"
## [56] "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL" "CLT" "CMH" "CMI"
## [67] "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG" "CVG" "CWA" "CYS"
## [78] "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN" "DIK" "DLH"
## [89] "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "EKO" "ELM" "ELP"
## [100] "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA" "FLG"
## [111] "FLL" "FLO" "FNL" "FNT" "FOE" "FSD" "FSM" "FWA" "FYV" "GCC" "GEG"
## [122] "GFK" "GJT" "GNV" "GPT" "GRB" "GRK" "GRR" "GSO" "GSP" "GTF" "GTR"
## [133] "GUC" "GYY" "HDN" "HFD" "HGR" "HHH" "HLN" "HOB" "HOU" "HPN" "HRL"
## [144] "HSV" "HTS" "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "ILE" "ILG"
## [155] "ILM" "IND" "INL" "IPL" "IPT" "ISN" "ISP" "ITH" "JAC" "JAN" "JAX"
## [166] "JFK" "JLN" "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LCH" "LCK" "LEB"
## [177] "LEX" "LFT" "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWS" "LYH"
## [188] "MAF" "MBS" "MCI" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [199] "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH" "MOB"
## [210] "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [221] "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL" "OTH"
## [232] "PAH" "PBG" "PBI" "PDX" "PFN" "PGV" "PHF" "PHL" "PHX" "PIA" "PIE"
## [243] "PIH" "PIT" "PLN" "PNS" "PSC" "PSM" "PSP" "PUW" "PVD" "PWM" "RAP"
## [254] "RDD" "RDM" "RDU" "RFD" "RIC" "RNO" "ROA" "ROC" "RST" "RSW" "SAF"
## [265] "SAN" "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SDF" "SEA" "SFO"
## [276] "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SNA" "SPI" "SRQ" "STL"
## [287] "SUN" "SWF" "SYR" "TLH" "TOL" "TPA" "TRI" "TYS" "VPS"
##
## $airport_2
## [1] "ATL" "AUS" "AVL" "AVP" "AZA" "BDL" "BFL" "BHM" "BIL" "BKG" "BLI"
## [12] "BMI" "BNA" "BOI" "BOS" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE"
## [23] "CAK" "CEC" "CHA" "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL"
## [34] "CLT" "CMH" "CMI" "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG"
## [45] "CVG" "CWA" "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN"
## [56] "DIK" "DLH" "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "ELM"
## [67] "ELP" "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA"
## [78] "FLG" "FLL" "FLO" "FNT" "FSD" "FSM" "FWA" "FYV" "GCC" "GCK" "GEG"
## [89] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [100] "GTR" "GUC" "HDN" "HFD" "HHH" "HLN" "HOU" "HPN" "HRL" "HSV" "HTS"
## [111] "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP" "ILE" "ILG" "ILM"
## [122] "IND" "IPT" "ISN" "ISP" "ITH" "IYK" "JAC" "JAN" "JAX" "JFK" "JLN"
## [133] "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE" "LCH" "LEB" "LEX" "LFT"
## [144] "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWB" "LWS" "LYH" "MAF"
## [155] "MBS" "MCI" "MCN" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [166] "MGW" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH"
## [177] "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY"
## [188] "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL"
## [199] "OTH" "OXR" "PAH" "PBG" "PBI" "PDX" "PFN" "PGD" "PGV" "PHF" "PHL"
## [210] "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM" "PSP"
## [221] "PUW" "PVC" "PVD" "PVU" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI"
## [232] "RIC" "RKD" "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN"
## [243] "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SCK" "SDF" "SEA" "SFB"
## [254] "SFO" "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SMX" "SNA" "SPI"
## [265] "SPS" "SRQ" "SSI" "STC" "STL" "STS" "SUN" "SUX" "SWF" "SYR" "TEX"
## [276] "TLH" "TOL" "TPA" "TRI" "TSS" "TTN" "TUL" "TUP" "TUS" "TVC" "TWF"
## [287] "TXK" "TYR" "TYS" "UIN" "VGT" "VLD" "VPS" "WAS" "WYS" "XNA" "YKM"
## [298] "YNG" "YUM"
##
## $carrier_lg
## [1] "3M" "99" "9K" "AA" "AS" "AX" "AZ" "B6" "CO" "DH" "DL" "E9" "F9" "FL"
## [15] "G4" "GQ" "HP" "JI" "KS" "LH" "N7" "NJ" "NK" "NW" "OS" "PN" "RP" "SY"
## [29] "TW" "TZ" "U5" "UA" "US" "VX" "WN" "XP" "YX" "YY" "ZV"
##
## $carrier_low
## [1] "3M" "99" "9K" "AA" "AQ" "AS" "AX" "AZ" "B6" "BA" "CO" "DH" "DL" "E9"
## [15] "EY" "F9" "FL" "G4" "GQ" "HP" "IB" "JI" "KS" "LH" "N7" "NH" "NJ" "NK"
## [29] "NW" "OH" "OO" "OS" "PN" "RP" "SN" "SY" "TW" "TZ" "U5" "UA" "UP" "US"
## [43] "VX" "WN" "WS" "YV" "YX" "YY" "ZV"
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.00 Min. : 67 DCA : 7730
## 1st Qu.:2005 1st Qu.:1.00 1st Qu.: 619 DFW : 7608
## Median :2010 Median :2.00 Median : 962 EWR : 7213
## Mean :2009 Mean :2.47 Mean :1095 IAD : 5596
## 3rd Qu.:2012 3rd Qu.:3.00 3rd Qu.:1476 IAH : 5110
## Max. :2014 Max. :4.00 Max. :2783 JFK : 5071
## (Other):121930
## airport_2 passengers fare carrier_lg
## TPA : 8271 Min. : 0.100 Min. : 17.0 DL :37638
## ORD : 7902 1st Qu.: 7.609 1st Qu.: 184.3 AA :24327
## LGA : 5838 Median : 27.283 Median : 233.2 UA :23099
## MDW : 5660 Mean : 186.968 Mean : 242.0 US :20919
## JFK : 5183 3rd Qu.: 148.152 3rd Qu.: 288.9 WN :19847
## IAD : 4719 Max. :7021.087 Max. :2566.9 CO :13325
## (Other):122685 (Other):21103
## large_ms fare_lg carrier_low lf_ms
## Min. :0.1000 Min. : 17.0 DL :34342 Min. :0.0100
## 1st Qu.:0.5083 1st Qu.: 181.6 AA :28543 1st Qu.:0.2171
## Median :0.6772 Median : 231.5 UA :19720 Median :0.4894
## Mean :0.6883 Mean : 241.8 US :19502 Mean :0.5293
## 3rd Qu.:0.8965 3rd Qu.: 289.4 WN :16374 3rd Qu.:0.8762
## Max. :1.0000 Max. :2566.9 CO :12084 Max. :1.0000
## (Other):29693
## fare_low haul
## Min. : 12.0 short : 27889
## 1st Qu.: 167.3 medium:132369
## Median : 212.9
## Mean : 223.4
## 3rd Qu.: 265.7
## Max. :2566.9
##
Across all of the airport pair markets:
What is the shortest airport pair distance?
## Source: local data frame [1 x 3]
##
## airport_1 airport_2 nsmiles
## 1 MKE ORD 67
I’m curious which airport pairs had the most passengers traveling (on average) across all quarters:
## Source: local data frame [7,824 x 3]
##
## airport_1 airport_2 mean_passengers
## 1 LAX SFO 4853.820
## 2 JFK LAX 4450.665
## 3 LGA ORD 4016.727
## 4 FLL LGA 3635.271
## 5 ATL LGA 3553.825
## 6 JFK SFO 3130.356
## 7 LAX ORD 2823.389
## 8 LAS LAX 2815.348
## 9 FLL JFK 2769.823
## 10 EWR MCO 2764.043
## .. ... ... ...
Now, I’m wanting to see what the most traveled airport pairs were for any given quarter:
## Source: local data frame [7,824 x 3]
##
## airport_1 airport_2 max_passengers
## 1 JFK LAX 7021.087
## 2 FLL LGA 6134.340
## 3 LAX SFO 5829.239
## 4 LGA ORD 5783.587
## 5 JFK SFO 5165.326
## 6 FLL JFK 4974.610
## 7 BOS LGA 4563.407
## 8 LAS LAX 4527.000
## 9 ATL LGA 4494.505
## 10 JFK MCO 4246.630
## .. ... ... ...
## Source: local data frame [1 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2014 4 2510 JFK LAX 7021.087 412.7212 DL
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
I’m wondering now how many unique airports there are in the dataset:
## Source: local data frame [338 x 1]
##
## ap
## 1 YUM
## 2 YNG
## 3 YKM
## 4 XNA
## 5 WAS
## 6 UIN
## 7 TXK
## 8 TWF
## 9 TVC
## 10 TUS
## .. ...
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 ORD 3715830 339.7486 217.0218
## 2 LGA 3043005 281.5511 233.9431
## 3 DFW 2782758 269.6210 226.9230
## 4 EWR 2414071 226.1213 264.6953
## 5 JFK 2209363 215.4635 234.3288
## 6 TPA 2094884 243.4780 189.0277
## 7 LAX 2054506 549.7743 263.5185
## 8 DCA 1955809 187.9321 249.1613
## 9 MDW 1780883 190.6522 194.8835
## 10 IAH 1721053 181.3353 240.0298
## .. ... ... ... ...
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 ATL 1201812.8 1074.9668 192.5350
## 2 MCO 1128439.9 1068.5984 171.9714
## 3 DEN 874282.2 785.5186 202.4002
## 4 LAS 1231757.4 734.0628 196.4701
## 5 PHX 771706.4 633.5849 218.0178
## 6 SEA 643998.6 581.7512 244.1213
## 7 LAX 2054506.4 549.7743 263.5185
## 8 DTW 595011.0 549.4100 209.9226
## 9 MSP 591489.3 536.2550 234.4933
## 10 SAN 519371.6 511.1926 247.7528
## .. ... ... ... ...
I wonder what the most expensive airports to travel from are:
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 TSS 4.590513 0.1995875 616.5870
## 2 JRB 1.524128 0.2177326 516.5652
## 3 IYK 2960.440939 38.4472849 480.1068
## 4 DIK 80.115942 11.4451346 456.3364
## 5 ISN 394.069703 13.1356568 452.4509
## 6 SMX 227.339377 7.8392889 424.4812
## 7 SUN 3965.633287 18.0256059 377.8083
## 8 OXR 220.545458 6.3012988 372.9604
## 9 IDA 16632.282650 41.7896549 361.7234
## 10 ASE 32450.613572 49.9240209 359.4056
## .. ... ... ... ...
Moving on to other variables:
## 99%
## 493
The plot looks to have a good normal curve, but with a long tail. The 99th quantile of the dataset is just under 500, so it makes sense to set a limit on the x-axis to better view the histogram of fares
I’m curious what the outliers look like:
## Source: local data frame [1,600 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2013 2 2583 HPN MRY 0.1098901 2566.860 UA
## 2 2008 1 278 DAL XNA 0.1098901 2156.020 AA
## 3 2011 4 2508 HPN MFR 0.1086957 1824.990 UA
## 4 2008 2 1746 EGE JFK 0.2197802 1813.935 NW
## 5 2003 3 1377 BGM EFD 0.1000000 1645.000 CO
## 6 2007 2 283 JFK PHF 0.1098901 1311.990 US
## 7 2012 1 1372 DAL SBP 0.2197802 1308.550 UA
## 8 2002 4 840 EFD GSP 0.1000000 1292.000 CO
## 9 2011 1 1848 BUR MGM 0.1111111 1255.980 UA
## 10 2014 1 2292 BUR EYW 0.1111111 1248.000 DL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
Looks like most large fares involve large distances and I’ll want to look at the relationship between fares and distance when I get to bivariate analysis. I also notice that the average passengers per day is under 1 for those city pairs. Another interesting thing to look at is the market share for those city pairs. The DAL<->XNA pair seemed odd because of the low mileage, so I looked on Kayak to see what current prices were and they were all above $500.
I am curious about what the average fare is for airport pairs that have less than 1 person / day on average to see if it is high:
## Source: local data frame [1 x 1]
##
## avgfare
## 1 297.2354
Surprisingly, it isn’t as high as I thought it would be, but does fall beyond the third quartile.
Mileage between airport pairs looks somewhat multimodal with heteroskedasticity or possibly a skewed normal distribution.
Re-scaling x-axis (log10) to view the long tail better:
Now, I’m curious about the difference between the average low and large fares:
## Source: local data frame [1 x 1]
##
## fare_diff
## 1 18.37674
## Source: local data frame [1 x 1]
##
## mean(fare_diff_percentage)
## 1 0.07012206
So, it looks like fares have a variance of 7% of the average fare between the average large and low fares.
Seems like there is a sizable count of carriers with the largest fare that have 100% market share. I’d like to compare that to the rest:
## Source: local data frame [2 x 2]
##
## large_ms == 1 n
## 1 FALSE 147608
## 2 TRUE 12650
## n
## 1 0.08569996
This last one is interesting, since there is a spike at 10%, so I looked at the data to verify that it was good, which it is. The reason for this spike in the data is explained by the DOT:
Note that the “lowest fare carrier” is the carrier with the lowest average fare that has at least a 10 percent share of the traffic in the market, except for markets where only a single carrier has a 10 percent or greater share.
## Source: local data frame [7,270 x 8]
##
## Year quarter nsmiles airport_1 airport_2 carrier_low lf_ms fare_low
## 1 2001 3 696 ABE MDW CO 0.0741 179
## 2 2001 3 655 ABE ORD DL 0.0144 198
## 3 2001 3 1270 ABQ ATL TW 0.0834 143
## 4 2001 3 744 ABQ IAH AA 0.0286 117
## 5 2001 3 677 ABQ LAX HP 0.0853 97
## 6 2001 3 719 ABQ MCI CO 0.0185 99
## 7 2001 3 1122 ABQ MDW AA 0.0152 146
## 8 2001 3 889 ABQ OAK HP 0.0725 119
## 9 2001 3 610 ABQ SAT AA 0.0714 138
## 10 2001 3 718 ALB MDW CO 0.0220 109
## .. ... ... ... ... ... ... ... ...
Finally, I’m curious about what percentage of carriers are both the largest and lowest fare carrier:
## Source: local data frame [1 x 1]
##
## n()/nrow(af)
## 1 0.5591921
There are 160,258 records in the dataset with 13 features (Year, quarter, nsmiles, airport_1, airport_2, passengers, fare, carrier_lg, large_ms, fare_lg, carrier_low, lf_ms, and fare_low). The variables airport_1, airport_2, carrier_lg, and carrier_low are unordered factor variables.
Other observations:
The main features of interest to me in the dataset are fare, nsmiles, and passengers. I’d like to figure out what variables influence the fare. My guess is that distance between airports is a factor and that the activity between airports may have some weight.
Market share may have an impact on the average fare. My guess would be that areas that have high market share have less competition and may have higher fares.
At the time of my univariate analysis I had not created any variables. However, after performing bivariate analysis I thought it would be good to create an additional categorical variable, haul, for segmenting short and medium trips. A medium trip is anything over 500 miles.
Data cleaning was the bulk of the work during my investigation. I saw anomalies in market share, large and low fares, and erroneous data that needed to be marked as NA. Some variable naming changed across the years of the dataset and some fields were combined, so I had to fix those as well. I added the fixup code to the data loading functions. These changes were necessary in order to have consistent data to analyze.
## Year quarter nsmiles passengers
## Year 1.00000000 -0.0387753123 0.127808813 0.0140409715
## quarter -0.03877531 1.0000000000 -0.003134517 0.0007013818
## nsmiles 0.12780881 -0.0031345172 1.000000000 -0.0144414386
## passengers 0.01404097 0.0007013818 -0.014441439 1.0000000000
## fare 0.31893655 -0.0370997782 0.457826614 -0.1829057014
## large_ms -0.02335850 -0.0056159138 -0.343139151 -0.0690767254
## fare_lg 0.29604732 -0.0357017595 0.431890715 -0.1551365111
## lf_ms 0.08091732 -0.0161899485 -0.236037321 -0.0919099302
## fare_low 0.32965613 -0.0215216567 0.376823089 -0.1788191433
## fare large_ms fare_lg lf_ms fare_low
## Year 0.31893655 -0.023358503 0.29604732 0.08091732 0.329656135
## quarter -0.03709978 -0.005615914 -0.03570176 -0.01618995 -0.021521657
## nsmiles 0.45782661 -0.343139151 0.43189072 -0.23603732 0.376823089
## passengers -0.18290570 -0.069076725 -0.15513651 -0.09190993 -0.178819143
## fare 1.00000000 -0.140722104 0.96032828 -0.07386059 0.905739991
## large_ms -0.14072210 1.000000000 -0.13615080 0.69085587 0.004520862
## fare_lg 0.96032828 -0.136150796 1.00000000 -0.11410075 0.873874571
## lf_ms -0.07386059 0.690855870 -0.11410075 1.00000000 0.081461978
## fare_low 0.90573999 0.004520862 0.87387457 0.08146198 1.000000000
##
## Pearson's product-moment correlation
##
## data: af$nsmiles and af$fare
## t = 206.1513, df = 160256, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4539482 0.4616877
## sample estimates:
## cor
## 0.4578266
##
## Pearson's product-moment correlation
##
## data: af$Year and af$fare
## t = 134.7118, df = 160256, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3145317 0.3233276
## sample estimates:
## cor
## 0.3189366
As predicted, there is a strong positive correlation between mileage and fare, but passengers has a negligible relationship to price. Year has a moderate positive correlation with fare.
It is interesting to see that carriers with the largest fare lose market share as the distance increases. My guess is that with the correlation between distance and fare that there is more competition across longer flights because there is more revenue to be gained.
## Source: local data frame [2 x 2]
##
## haul mean(passengers)
## 1 short 212.5549
## 2 medium 181.5771
Interesting to see that short haul trips have more passengers on average than medium trips (e.g. coast-to-coast travel)
Looks like 2012 was the best year for average passengers traveling, a drop in 2013, and a return in 2014.
It’s interesting to see the average fares among the most traveled airports and see which ones are higher or lower than the median fare among the group.
## Source: local data frame [6 x 3]
##
## larger_ap median_fare diff
## 1 LAX 275.4786 42.290852
## 2 EWR 261.3781 28.190415
## 3 VPS 244.9723 11.784548
## 4 DCA 242.0185 8.830777
## 5 XNA 238.5907 5.402982
## 6 IAH 237.1551 3.967411
Well, looks like if you’re in any of the cities above you’re paying more than the median fare across all airports, but it also depends on where you’re going. Also, it’s worth mentioning that these airports serve plenty of markets, which could be a factor.
##
## Call:
## lm(formula = fare ~ nsmiles, data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -252.76 -50.20 -9.17 38.68 2232.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.741e+02 3.798e-01 458.4 <2e-16 ***
## nsmiles 6.200e-02 3.008e-04 206.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.7 on 160256 degrees of freedom
## Multiple R-squared: 0.2096, Adjusted R-squared: 0.2096
## F-statistic: 4.25e+04 on 1 and 160256 DF, p-value: < 2.2e-16
Fare correlates strongly with the distance that is being traveled. There has also been a slow trend towards increasing prices since 2005, but it might be good to compare this to inflation to see if actual costs have increased. Passengers had a negligible relationship to fare.
Based on the \(R^2\) value, distance accounts for 20.96% of the variance in fare. Other features can be incorporated into the model to better explain the variance in price, which I will build on in multivariate analysis.
Yes, market share seems to drop when distance traveled increases. I think that is because there is more competition among longer distance flights, since there is more revenue to be gained.
The strongest relationship I found was between fare and distance.
It’s neat to see the cut-off of distance with DFW and ORD, which makes sense since they’re more central to the country.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Source: local data frame [1 x 1]
##
## short_haul_value
## 1 0.6775934
## Source: local data frame [1 x 1]
##
## med_haul_value
## 1 0.2301141
If you’re traveling under 500 miles, then you can expect to pay $0.68 per mile on average. If you’re traveling over 500 miles, then you can can expect to pay roughly $0.23 per mile on average.
Now, I want to see a better breakdown of fare value per distance traveled:
##
## Formula: fare/nsmiles ~ A/sqrt(nsmiles)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## A 9.6914 0.0125 775.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1875 on 160257 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 6.06e-11
As you can see there is a lot of variance in cost per mile for travel under 500 miles. However, as a consumer if you travel larger distances you will eventually enjoy much better costs per mile.
With the graph of fare value being non-linear, I now want to revisit fare vs miles traveled with different scales:
I now have a much better linear fit here. Let’s revisit the linear model and see if it improves:
##
## Call:
## lm(formula = I(log10(fare)) ~ I(sqrt(nsmiles)), data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14930 -0.08221 0.00236 0.08421 1.08870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.119e+00 1.148e-03 1846.2 <2e-16 ***
## I(sqrt(nsmiles)) 7.545e-03 3.468e-05 217.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.133 on 160256 degrees of freedom
## Multiple R-squared: 0.2279, Adjusted R-squared: 0.2279
## F-statistic: 4.731e+04 on 1 and 160256 DF, p-value: < 2.2e-16
\(R^2\) went up by 1.83% with this change, so we can now better account for the relationship between fares and distance between airports.
## Source: local data frame [10 x 2]
##
## larger_ap median_value
## 1 LAX 5.768249
## 2 TPA 5.688935
## 3 JFK 4.678094
## 4 LGA 4.536045
## 5 IAH 4.080295
## 6 MDW 4.064503
## 7 EWR 4.023544
## 8 DCA 3.921803
## 9 DFW 3.913508
## 10 ORD 3.585313
Here, we can see that LAX (Los Angeles Airport) fares enjoy the most bang for the buck (i.e. miles per dollar) among the larger airports. ORD (Chicago Airport) is among the weakest in terms of miles per dollar. You could consider this if you were moving to either of these two cities and planning to travel quite a bit.
## Source: local data frame [12 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2002 1 1515 LAS ORD 2074.440 142.7000 N7
## 2 2002 2 1515 LAS ORD 2441.640 139.4100 N7
## 3 2002 3 2066 IAD LAS 320.970 209.5000 N7
## 4 2002 3 1515 LAS ORD 2437.500 127.9900 N7
## 5 2001 2 2248 JFK LAS 2223.626 169.5230 N7
## 6 2001 2 1521 LAS MDW 1936.044 127.0379 N7
## 7 2001 3 1055 DFW LAS 1474.000 138.0000 N7
## 8 2001 3 2248 JFK LAS 2227.000 166.0000 N7
## 9 2001 3 1521 LAS MDW 1628.000 123.0000 N7
## 10 2001 3 2176 LAS MIA 684.000 158.0000 N7
## 11 2001 3 2177 LAS PHL 1297.000 156.0000 N7
## 12 2001 3 414 LAS SFO 2340.000 77.0000 N7
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
National Airlines gave the best mileage per dollar fares between 2001-2002, but it might explain why they are no longer around, too.
## Source: local data frame [7 x 2]
##
## carrier_lg median_value
## 1 VX 7.246457
## 2 B6 6.754951
## 3 WN 5.078585
## 4 AA 4.472823
## 5 UA 4.424600
## 6 DL 3.969786
## 7 US 3.059363
Historically, it looks like Virgin America and JetBlue Airways are the two airlines that provide the best value for mileage per dollar. If you’re traveling long distances across country, it’s likely you’ll find good fares with those two airlines.
Over the years we have seen an increasing cost per mile traveled, but currently we’re experience a downward trend in cost per mile.
Back to our linear model of log10(fare) ~ sqrt(nsmiles) + additional variables:
## R-squared for model: 0.556
## [1] 0.1008139
## Don't know how to automatically pick scale for object of type AsIs. Defaulting to continuous
With an understanding of the strong relationship between distance and fare it made sense to me to look at other variables from that perspective. After seeing the relationship of fare value (cost per mile) to distance I thought to improve on the linear model that I had started in bivariate analysis.
The cost per mile graph was surprising in that the cost can go from $1 / mi at around a 250 mile distance to less than $0.40 / mi at around a 500 mile distance. The additional analysis into worst and best value airports and carriers was interesting to me.
Yes, I created a linear model using the log10 of fare and the square root of nsmiles. Other variables included are Year and airport pair (i.e. airport_1 and airport_2).
The variables of the linear model account for 55.6% of the variance in airline fares.
The cost per mile for air travel has fluctated over time and was the worst in 2008. Currently, we are experiencing a downward trend in cost per mile traveled.
The plot of mileage per dollar across the ten largest airports (by activity) shows that Los Angeles International Airport (LAX) and Tampa International Airport (TPA) have the best median mileage per dollar spent. O’Hare International Airport (ORD) has the worst median mileage per dollar.
The plot of mileage per dollar across the most popular airlines shows that Virgin America and JetBlue Airways provide the best value in terms of how far you can travel per dollar. US Airways (now owned by American Airlines) is among the worst in terms of mileage per dollar.
The plot of fare value (i.e. cost per mile) with one-way distance between airports roughly follows an inverse square root function. The cost per mile has high variance for travel up to 500 miles.
The DOT Domestic Airline Consumer Airfare Report has quarterly airfare data for 1996-2014. Unfortunately, only 2001-2014 had complete and usable data, so that is what was used by me to amass the 160,258 airport pair records. I started by simply cleaning and aggregating the data. Then, I explored single, double, and multiple variables in various ways to better understand the data. I was surprised that passenger activity did not have much weight on the predictive model of fare, but indirectly that may have been captured in the airport pair. The only strong positive correlation with fare was distance between airports, which was the basis of the linear model created. A secondary influence was Year, since there has been a general trend of increasing prices over time. The linear model improved once I realized that converting a non-linear relationship to a linear relationship (log10(fare) ~ sqrt(nsmiles) would better account for variance. Some limitations with this model are that it is averaged data instead of individual fares and that there are missing factors, such as how many days before the flight that the fare was purchased. If I had powerful computing at my disposal I would want to perform a similar analysis to this one on individual fare data.
Some interesting observations that I made while performing this analysis: